System and Method for Retrieving and Organizing Information From Disparate Computer Network Information Services

ABSTRACT

A system and method is provided for accessing information from a plurality of searchable information sources. The method includes the steps of: analyzing a user search query to determine a subject matter of the query; and selecting a sub-set of information from the plurality of information sources based upon the determined subject matter of the query. In further detailed embodiment, the analyzing step combines at least two methods of deriving the subject matter from the search query; and the method further includes the step of searching the information source(s) in the sub-set of information sources, substantially in parallel, for documents relevant to the search query. A system and method is also provided for searching a plurality of searchable information sources, where the information sources include at least one secure source. This method includes the steps of: (a) storing security credentials necessary for accessing the secure source; (b) accessing the secure source utilizing the stored security credentials; (c) accessing a non-secure source; (d) searching the accessed sources, substantially in parallel, for documents relevant to a search query; and (e) displaying results of the searching step.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application is a division of U.S. patent application Ser.No. 10/738,554, filed 3 Mar. 2003, entitled SYSTEM AND METHOD FORRETRIEVING AND ORGANIZING INFORMATION FROM DISPARATE COMPUTER NETWORKINFORMATION SOURCES; which claims the benefit from U.S. ProvisionalPatent Application Ser. No. 60/360,754, filed Mar. 1, 2002 entitledSYSTEM AND METHOD FOR RETRIEVING AND ORGANIZING INFORMATION FROMDISPARATE COMPUTER NETWORK INFORMATION SOURCES; the contents of whichare incorporated herein by reference.

BACKGROUND

The present invention is a computerized system and method for searchingthrough and retrieving information from a plurality of informationsources; and more particularly, the present invention is anenterprise-scale system and method for searching for and retrievinginformation from a plurality of disparate electronic information sourceswithin a large computer network and/or from the Internet.

A federated search system, by its very definition, distributes searchqueries in real-time to the information sources selected for querying.In a very large scale federated search system, one that involveshundreds or even thousands of information sources, the method ofreal-time querying of large numbers of information sources becomesimpractical. It is desired to bring some intelligence to the searchprocess that would permit an appropriate subset of the informationsources to be selected for querying rather than all the availablesources.

Secure information sources within a federated search system also pose aunique set of challenges. At a fundamental level, the federated searchsystem should be able to proxy the user credentials to a secureinformation source (i.e., make it appear to the secure informationsource that the user was natively interacting with it). This iscomplicated, however, by the following circumstances: multiple secureinformation sources could be in the searching mix at the same time; eachsecure information source could require different methods for handlingsecurity (this can include LDAP, HTTP-basic authentication, HTTPS,cookie-based authentication using custom forms, proprietarysingle-sign-ons, etc.); and the system should transparently handle thesecurity log-ins, parameters and protocols for multiple users, possiblyaccessing multiple secure information sources at the same time.

Finally, in a large federated search system, a reasonable effort couldinvolve manually creating brokers (sometimes referred to as “wrappers”)to define and interface between the system and the respective multiplesearchable information sources accessed by the system. It is desired toreduce user interaction needed to create and maintain the brokers byproviding an automated, or semi-automated broker generation capability.

SUMMARY

The present invention provides an enterprise-scale system and method forsearching and retrieving electronic information from disparateelectronic information sources within a large organization (an intranet)and/or from the Internet. At the heart of the system is a “federatedsearch” architecture and system that enables a single search query froma user to be delivered in real-time to various selected islands ofinformation. Depending upon the embodiment, the system can collateresults, removes duplicates and dead-links, apply composite relevancescoring, and deliver the relevant results to the user.

In an exemplary embodiment, each island of information is a searchablesource that is represented in the system by a “broker”, which defineshow the system accesses the respective information source and how thesystem handles the interface between the system and the informationsource. Further, in the exemplary embodiment, a broker-definition toolreferred to as the “agent development kit” (ADK) is used to create thebrokers in a semi-automated fashion (and, possibly, a completelyautomated fashion) and deploy them to the live, operational system.

The exemplary embodiment of the system and method of the presentinvention also provides a technique, referred to as “adaptive search”,which intelligently selects subsets of information sources (from a bodyof available information sources) to route search queries to in thelarge federated search scenario. The selection of sources is based uponan analysis of the subject matter of the query. The search in thisselected subset of information sources can occur automatically, or theuser can be provided the option to have the search run in this subset ofinformation sources (when the general search results are displayed, forexample).

This adaptive search function is facilitated, in the exemplaryembodiment, by the use of a knowledge-base (also referred to as a“subject taxonomy”), which is a hierarchical arrangements of subjects,where each subject is represented by a “fingerprint” of information thatwill typically be found in documents specific to such subjects. Thesefingerprints can be generated from example documents provided for eachof the subjects in the taxonomy. Subjects in the subject taxonomy canalso be linked to entity lists, which provide a list of names, symbolsor other terms typically associated with a respective subject. Bycomparing the search query against the subject taxonomy and/or theentity lists, the subject matter of the search query can be determinedwithin a desired level of confidence.

The exemplary embodiment of the present invention also utilizes acomprehensive, multi-user, multi-source, multi-modal security handlingarchitecture to allow users to query open sources (non-secure sources)as well as secure sources simultaneously in a substantially transparentfashion. Additionally, the exemplary embodiment of the present inventionprovides a methodology to incorporate the security handling protocolsand parameters into the definitions of the brokers, again, in asemi-automated fashion.

Therefore, it is a first aspect of the present invention to provide Acomputer implemented method for accessing information from a pluralityof searchable information sources. The method includes the steps of: (a)analyzing a user search query to determine a subject matter of thequery; and (b) selecting a sub-set of information from the plurality ofinformation sources based upon the determined subject matter of thequery. In a detailed embodiment, the analyzing step combines at leasttwo different methods of deriving a subject matter from the searchquery. In a further detailed embodiment, the method further includes thestep of (c) searching at least one information source in the sub-set ofinformation sources for documents relevant to the search query. Inanother alternate detailed embodiment, one deriving method of theanalyzing step includes the step of comparing at least a portion of thesearch query against a plurality of entity lists, where each entity listincludes a list of phrases, and where each of the phrases correspondswith one or more subject matters; and the comparing step includes thestep of matching the phrase in an entity list against at least a portionof the search query, and upon such match, returning a subject mattercorresponding to the matched phrase in the entity list.

In yet another alternate embodiment of the first aspect of the presentinvention, one deriving method of the analyzing step includes the stepof comparing the search query against a knowledge base, where theknowledge base includes a taxonomy of subject matters and a set of termsfor at least some of the respective subject matters in the taxonomy,where the set of terms represent information likely to be found for therespective subject matters; and the comparing step compares at leastportions of the search query against the set of terms in the knowledgebase to determine the respective subject matters of the matching terms.In a further detailed embodiment, the method further includes the stepof building the knowledge base, where the building step includes thesteps of: (i) defining a taxonomy of subject matters; (ii) for at leastsome of the subject matters in the taxonomy, providing at least oneexample document that represents content typically found for therespective subject matter; (iii) generating a set of terms from theexample document; and (iv) linking the set of terms to the respectivesubject matter. In yet a further detailed embodiment, the taxonomy isstructured as a multi-tier hierarchy. In an alternate detailedembodiment, the step of comparing the search query against theknowledge-base further includes a step of assigning a score to thedetermined subject matter based upon a confidence level of thecomparison. In yet a further detailed embodiment, the step ofdetermining a subject matter of the query further includes the steps ofdisplaying one or more of the subject matters having a score greaterthan a predetermined threshold and selecting, by a user, at least one ofthe displayed subject matters. In yet another alternate detailedembodiment, the analyzing step determines a plurality of the subjectmatters, and the method further includes a step of organizing thedetermined plurality of subject matters according, at least in part, tothe scores assigned to the plurality of subject matters.

In yet another alternate detailed embodiment of the first aspect of thepresent invention, the steps of selecting a sub-set of informationsources includes the steps of (i) providing a category-to-source mapthat includes a plurality of categories, where the categories have atleast one information source linked thereto, (ii) obtaining at least onecategory pertaining to the subject matter of the query, and (iii) addingthe information source linked to the category in the category-to-sourcemap to the sub-set of information sources. In a further detailedembodiment, each information source is assigned a performance scorepertaining to at least one performance quality of the informationsource. In yet a further detailed embodiment, the method furtherincludes the steps of searching at least one information source in thesub-set of information sources for document(s) relevant to the searchquery and displaying the search results from the output of the searchingstep, where the displaying step displays the search results in an orderbased upon, at least in part, the performance scores of the informationsources from which the search results are obtained. In an alternatedetailed embodiment, the performance quality is based upon the frequencythat the respective information source is accessed, the amount of timespent accessing the respective information source, the frequency ofproblems accessing the respective information source, and/or feedbackprovided by users of the respective information source. In yet a furtheralternate detailed embodiment, the method further includes the step ofeliminating from the sub-set of information sources any informationsource having a performance score lower than a predetermined threshold.

In a alternate detailed embodiment of the first aspect of the presentinformation, the method further includes the steps of (c) assigning eachinformation source in the sub-set of information sources a performancescore pertaining to performance qualities of the information source; (d)searching the information sources in the sub-set of information sourcesfor documents relevant to the search query; and (e) displaying searchresults from the output of the searching step, where the search resultsare ordered based upon, at least in part, the performance scores of theinformation sources from which the search results are obtained. In afurther detailed embodiment, the performance scores are calculatedbased, at least in part, upon the number of times the respectiveinformation source is accessed by a community of users.

In yet another alternate detailed embodiment of the first aspect of thepresent invention, the method further includes the steps of (c)searching the information sources in the sub-set of information sourcesfor document relevant to the search query; and (d) displaying the searchresults from the output of the searching step, where the search resultsare segregated for each of the information sources in the sub-set ofinformation sources. In a further detailed embodiment, the searchingstep searches the information sources in the sub-set of informationsources substantially in parallel and the displaying step displays thesegregated searches in parallel.

In yet a further detailed embodiment of the first aspect of the presentinvention, the method further includes the steps of: (c) searching astandard information source (such as the World Wide Web) for documentsrelevant to the search query; and (d) displaying the results of the stepof searching the standard information source along with an option,selectable by the user, for searching the sub-set of information sourcesfor documents relevant to the search query upon selection of the optionby the user. As mentioned above, this standard information source couldbe the World Wide Web and further, the sub-set of information sourcesmay be maintained, for example, on a private computer network. In afurther detailed embodiment, the analyzing step determines a pluralityof subject matters from the query, the selecting step selects a sub-setof information sources for each of the plurality of the subject mattersdetermined in the analyzing step, the displaying step displays theplurality of options for each subject matter determined in the analyzingstep, where each option is identified by its respective subject matterin the displaying step and where each option is provided for searchingthe sub-set of information sources associated therewith for documentsrelevant to the search query upon selection of the option by the user.

In yet a further detailed embodiment of the first aspect of the presentinvention, the method further includes the steps of (c) searching astandard information source for documents relevant to the search query,(d) searching the sub-set of information sources for documents relevantto the search query, and (e) simultaneously displaying the results ofthe step of searching the standard information source and the step ofsearching the sub-set of information sources. In a further detailedembodiment, the displaying step segregates the results of the step ofsearching the standard information source from the step of searching thesub-set of information sources.

In yet a further detailed embodiment of the first aspect of the presentinvention, the analyzing step determines a plurality of subject mattersfrom the query, and the selecting step selects a sub-set of informationsources for each of the plurality of subject matters determined in theanalyzing step. In a further detailed embodiment, the method furtherincludes the step of automatically searching the sub-set of informationsources associated with the subject matter having the closest match tothe search query for documents relevant to the search query.

It is the second aspect of the present invention to provide acomputer-implemented method for searching a plurality of informationsources, where the information sources include at least one securesource. This method includes the steps of: (a) storing securitycredentials necessary for accessing the secure source; (b) accessing thesecure source utilizing the stored security credentials; (c) accessing anon-secure source; (d) searching the accessed sources, substantially inparallel, for documents relevant to a search query; and (e) displayingresults of the searching step. In a further detailed embodiment, theplurality of information sources includes a plurality of secure sources,the step of storing security credentials includes the step of storingrespective security credentials necessary for accessing each securesource, and the step of accessing the secure source involves the step ofaccessing the plurality of secure sources, substantially in parallel,using the respective stored security credentials. In yet a furtherdetailed embodiment, the method operates on a computer network systemhaving a plurality of users and the step of storing security credentialsincludes the step of storing respective security credentials foraccessing each secure server by each user of the computer networksystem. In an alternate detailed embodiment, the security credentialsare stored in a database that includes a table for each user, where eachtable includes a set of respective security credentials for accessingeach secure source by each respective user. It is within the scope ofthe invention that at least certain of the security credentials may beshared by certain users (or groups of users) during the accessing and/orsearching steps.

In an alternate detailed embodiment of the second aspect of the presentinvention, the step of storing security credentials includes the stepsof recording a user's security credentials as the user preliminarilyenters the secure source and storing the recorded user's securitycredentials for the step of accessing the secure server. In yet afurther detailed embodiment, the stored user's security credentials arereusable for multiple steps of accessing the secured server. In analternate detailed embodiment, the security credentials are usedsubstantially transparently to the user during the step of accessing thesecure server. In yet another alternate detailed embodiment, the step ofaccessing the secure source further includes the step of storing sessioncookies set by the source for the duration of the search process.

It is a third aspect of the present invention to provide acomputer-implemented method for searching a plurality of searchableinformation sources by a plurality of users to a computer networksystem, where the information sources include at least one securesource. The method includes the steps of: (a) for each user, storingsecurity credentials necessary for accessing the secure source; (b)accessing, by each user, the secure source utilizing the stored securitycredentials for each user; and (c) searching the accessed secure source,by the plurality of users, substantially in parallel, for documentsrelevant to one or more search queries. In a further detailedembodiment, the method further includes the step of (d) creating asession record for each user accessing the secure source. In a furtherdetailed embodiment, the session record includes cookies, sessionparameters, session IDs, and/or a session state. In yet a furtherdetailed embodiment, the information sources include a plurality ofsecure sources, the storing step includes the step of storing, for eachuser, security credentials necessary for accessing one or more of theplurality of the secure sources, the accessing step includes the step ofaccessing, by each user, one or more of the plurality of secure sourcesutilizing the stored security credentials for each user, and thesearching step includes the step of searching the accessed securesources, by the plurality of users, for documents relevant to one ormore search queries. In yet a further detailed embodiment, a sessionrecord is created each time a user accesses a secure source.

In an alternate detailed embodiment of the third aspect of the presentinvention, the information sources include a plurality of securesources, the storing step includes the step of storing, for each user,security credentials necessary for accessing one or more of theplurality of secure sources, the accessing step includes the step ofaccessing, by each user, one or more of the plurality of secure sourcesutilizing the stored security credentials for each user, and thesearching step includes the step of searching the accessed securedsources, by the plurality of users, for documents relevant to one ormore search queries.

It is a fourth aspect of the present invention to provide a computerimplemented method for generating searchable source brokers for defininginterface parameters specific to each of the searchable sources. Themethod includes the steps of: (a) accessing a given searchable source;(b) performing an example search on the given searchable source toproduce search results by that searchable source; and (c) identifyingregular expressions from the search results. In a further detailedembodiment, the method further includes the step of storing the regularexpressions for the given searchable source for subsequent reuse by afederated search system. In a further detailed embodiment, the step ofidentifying regular expressions is performed substantiallyautomatically, the method further includes the step of reviewing, by auser, output of applying the regular expressions to search resultsproduced by the given searchable source, and the method further includesthe step of approving by the user the regular expressions based upon thereviewing step. In a further detailed embodiment, the method furtherincludes a step of modifying the regular expressions by the user beforethe approving step, if the user determines the modifying step isnecessary based upon the reviewing step. In an alternative detailedembodiment, the reviewing step involves the step of simultaneouslydisplaying to the user the search results produced by the given searchand the output of applying the regular expressions to the searchresults.

In an alternate detailed embodiment of the fourth aspect of the presentinvention, the step of identifying regular expressions includes thesteps of: (i) parsing the search results to distill a structure of thesearch results; (ii) identifying repeating blocks of information fromthe parsed search results; (iii) identifying essential search-resultelements from the repeating blocks of information; and (iv) generating aregular expression for each identified essential search-result elementand a regular expression for the repeating block. In a further detailedembodiment, the essential search-result elements include a title, a URL,a date, a keywords, a summary, a passage, and/or a score.

It is a fifth aspect of the present invention to provide a computerimplemented method for accessing information from a plurality ofsearchable information sources. The method includes the steps of:analyzing a user's search query to determine a subject matter of thequery; selecting a subset of information sources from the plurality ofinformation sources based upon the determined subject matter of thequery, wherein at least one of the subset of information sources is asecure information source; accessing the secure information sourceutilizing stored security credentials for the information source; andsearching the information sources in the subset of information sourcesfor documents relevant to the search query. In a more detailedembodiment the searching step involves the step of searching theinformation sources in the subset of information sources, substantiallyin parallel, for documents relevant to the query. In an alternatedetailed embodiment, the step of accessing the secure information sourceutilizes the stored security credentials substantially automatically andsubstantially transparently to the user.

In another alternate detailed embodiment of the fifth aspect of thepresent invention the step of searching the information sources in thesubset of information sources utilizes source brokers for each of theinformation sources in the subset of information sources, where thesource brokers define patterns of search-result information specific totheir respective information source. In a further detailed embodiment,the source broker for the secure information source includes the storedsecurity credentials utilized in the accessing step. In an alternatedetailed embodiment, method further includes the step of defining thesource broker for each of the information sources in the subset ofinformation sources. In a further detailed embodiment, the defining stepincludes the steps of: preliminarily accessing the respectiveinformation source; preliminarily performing an example search on therespective information source to produce example search results;identifying regular expressions from the example search results; andstoring the regular expressions as at least part of the source broker.In a further detailed embodiment, the defining step further includes thesteps of detecting whether the respective information source is a secureinformation source, and if the detecting step determines that therespective information source is a secure information source, performingthe additional steps of: providing a log-in form for the secureinformation source; logging into the secure information source byentering the appropriate log-in information to the log-in form by theuser; recording security credential information provided by the userduring the logging step; and storing the security credential informationwith the respective source broker.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a general block diagram of the system architecture of theexemplary embodiment of the present invention;

FIG. 2 is an example screen-shot illustrating the universal searchinterface of the exemplary embodiment;

FIG. 3 is an example screen-shot of the exemplary embodimentillustrating intelligent source selection capabilities combined withgeneral search results;

FIG. 4 is an example screen shot of the exemplary embodimentillustrating intelligent query routing to subsets of information sourceswith general search results;

FIG. 5 is an illustration of a source taxonomy, an organization ofsearchable information sources in XML format, of the exemplaryembodiment;

FIG. 6 is an illustration of a structure of mapping subjects toinformation sources, maintained in an XML file, of the exemplaryembodiment;

FIG. 7 is a block diagram representation of the interplay between theintelligent source selection, federated search and adaptive learnerfunctions of the exemplary embodiment;

FIG. 8 is an illustration of example entity lists according to theexemplary embodiment;

FIG. 9 is an illustration of a structure of mapping entity lists toinformation sources, maintained in an XML file, of the exemplaryembodiment;

FIG. 10 is an illustration of a subject taxonomy and example documentslinked to the subjects in the subject taxonomy, according to theexemplary embodiment;

FIG. 11 is an illustration of the information performance source rankingstructure according to the exemplary embodiment;

FIG. 12 is an example output representation of a broker definitiongenerated by the broker-definition tool according to the exemplaryembodiment;

FIG. 13 a block diagram representation of the multi-user, multi-source,multi-modal secure information source architecture according to theexemplary embodiment;

FIG. 14 is an illustration of a session-details database utilized by theexemplary embodiment for secure information source handling;

FIG. 15 is an example screen shot of the exemplary embodimentillustrating an initial stage of the broker-definition tool;

FIG. 16 is an example screen shot of the exemplary embodimentillustrating an extraction testing stage of the broker-definition tool,

FIG. 17 is an example screen shot of the exemplary embodimentillustrating a query definition stage of the broker-definition tool;

FIG. 18 is an example screen shot of the exemplary embodimentillustrating another testing stage of the broker-definition tool; and

FIG. 19 is an example screen shot of the exemplary embodimentillustrating secure source handling stage of the broker-definition tool.

DETAILED DESCRIPTION

The present invention provides an enterprise-scale system and method forsearching and retrieving electronic information from disparateelectronic information sources within a large organization (an intranet)and/or from the Internet. At the heart of the system is a “federatedsearch” architecture and system that enables a single search query froma user to be delivered (preferably, in real-time) to various searchableinformation sources.

As used herein, “information source”, “source” and “searchableinformation source” pertain to searchable information sources accessibleover a data network such as, for example, the World Wide Web or aproprietary computer network. The searchable information sources willtypically be search engines, or may include search engines or searchcapabilities associated therewith that provides the ability for a userto search the searchable information source for desired information. Itis not necessary, however, for the searchable information source to haveits own search capabilities embedded therein or associated therewith, assuch search capabilities can be provided elsewhere. Examples of suchsearchable information sources accessible over the World Wide Webinclude, MSN.com, LYCOS.com, TEOMA.com, Intellihealth.com, WebMD.com,WSJ.com, etc. Likewise, “secure information source”, “secure source” and“secure searchable information source” pertain to such searchableinformation sources that require certain security credentials (such aspasswords, for example) to access and/or perform searchestherein/therewith.

As used herein, “search query” and “query” pertain to an expression ofthe information that a user or system wishes or requests to search forin one or more searchable information sources. While the expression willtypically be in the form of a term or phrase typed into a field of anelectronic form by the user, it is within the scope of the inventionthat the expression be automatically generated and presented to thesearchable information source(s) and it is within the scope of theinvention that the expression be pre-stored and presented to thesearchable information source(s).

As used herein, “document” means an electronic body or collection ofinformation or data that the user or system will typically be providedaccess to by the searchable information source(s) in the searchresult(s) provided by the searchable information source(s) (althoughsome searchable information sources only identify the documents, withoutproviding access). This is typically the body or collection ofinformation or data that the user/system is ultimately seeking in thesearching process.

As used herein, the act of “searching” an information source or withinan information source, and the act of “searching by” an informationsource pertains to the act of applying the search query to one or moreof the searchable information sources to produce search results, whichmay or may not provide the user/system access to documents; but whichwill usually provide at least the identity of document(s) if the searchis successful. It is to be understood that the present invention is notlimited to any specific searching algorithm or technique.

As used herein, the act of “comparing” or “matching” a search query (orany other expression of information/data) against another expression ofinformation/data pertains to the use of any available techniques and/oralgorithms to perform a lexical comparison of the expression (or aportion of the expression) against terms, phrases or other expressionsof information or data in the other entity. The results of thiscomparison often do not necessitate exact matches to be considered“successful”; and, thus, often include confidence scores with theresults that indicate the relative confidence or closeness of thecomparison. While the exemplary embodiments herein often refer tolexical comparisons, it is within the scope of the invention thatalternate techniques/algorithms be used when the comparison is not alanguage-based comparison.

FIG. 1 provides a functional flow diagram representation of thefederated search system deployed according to an exemplary embodiment ofthe present invention. The searching function 10 provides aconfigurable, hierarchically organized group of information sources,described below, to users to fulfill different information needs andrequests from the multiple groups of users. A simple search involvestaking search query terms from the user to conduct the search. Anadvanced search enables users to select multiple groups of sources, ormultiple sources within a group, and to control many settings, includingthe depth of the search, analysis options, and time-outs. Personalizedsearching preferences are stored for each user by the system. Searchesinitiated in the system are conducted in real-time, and results aredisplayed in configurable web page format or in XML format.

In the intelligent source selection function 12, a user's search queryis analyzed to determine the subject matter corresponding to the user'squery. Upon identifying this subject matter of the search query, asub-set of information sources can be isolated from the vast body ofinformation sources to perform the search. For example, a search for“pancreatic cancer treatment protocol” can be determined by the systemto be broadly based on the subject-heading of “health”, and morespecifically, on the specific subjects of “diseases and conditions”, and“endocrinal disorders”. The sub-set of information sources is selectedby consulting an information source hierarchy, or subject-to-source map,to find the best sources for the identified subject matters. These bestperforming sources can automatically be given preference for searchingin real-time in addition to user-selected information sources, or thesebest sources can be offered as recommendations to the user forperforming further related searches.

The federated searching function 14 implements the actual real-time,distributed searching mechanism. This function receives as inputs thesearch query parameters and other optional advance settings, andaccesses one or more groups of information sources to perform thefederated searching in all or certain subsets of the informationsources. Information sources from which the real-time federatedsearching may be conducted include visible Web sources 16 accessibleover the Internet, invisible Web sources 18 accessible over theInternet, enterprise sources 20 (private information sources accessibleby the system over the system's intranet, for example), and subscriptionsources 22, which may be accessible over the Internet or throughseparate network connections. Each information source in the sub-set ofselected information sources is searched by the system in real-time,with user credentials being transparently proxied, if necessary, to eachsecure source 22. Multi-processing and multi-threading mechanisms areimplemented for scalability to large numbers of concurrently searchedsources as well as large numbers of concurrent users searching with thesystem. This federated searching function 14 translates a user's searchquery into the native forms required for each information source,communicates with each information source using native protocols andmethods, navigates through one or more search result sets from eachsource, extracts search result records including uniquely defined fieldsof information for each of the records from each source, normalizes theresults, removes duplicates, and performs composite relevance rankingbased upon specified, configurable relevance ranking criteria. An XMLresult stream is produced that can be operated upon by other componentsin the system.

The analysis/filtering function 24 is optionally triggered by the userto perform real-time retrieval and analysis of the full-text contents(documents) for each result from the composite result set delivered fromthe federated searching function 14. Each “document” is retrieved fromthe corresponding information source in the essential text content alongwith relevant meta-data is extracted from it. This function 24, inessence, “converts” content from different document formats like AdobePDF, Microsoft Word, etc. to native text. The text and meta-data contentcorresponding to each result record is then passed through a real-timefiltering component that takes one or more search queries representingthe user's input and then determines the strength of match of the resultto the user's need. In this analysis/filtering function 24, the passages(sentences or paragraphs) from the documents matching the user's queryare extracted and ranked to determine the strength of the match and tocompute a native “analysis score” which is used for relevance rankingpurposes. Next, a dynamic summary is composed from the extractedpassages for each matching document. Each result record is then enhancedwith additional meta-data including an “analysis score”, an updatedrelevance score, a dynamic summary snippet, as well as additionalinformation when the result document doesn't match the user's query.

The categorization function 26 categorizes the results from thefederated searching (and, optionally, the analysis/filtering function24) into a configured subject taxonomy. An administrator first creates ataxonomy of subjects representing a given information domain, providesexample documents for each subject, and runs an administrative tool totrain the taxonomy and create a model that is used for the real-timecategorization of the search result documents. During searching, thecategorization process involves deriving a “fingerprint” (importantterms representative of a respective content of the record, which can bephrases or individual words) from each result record and matching itwith the taxonomy model configured for use in the system. The bestmatching subject is determined for each result record, and is tagged asadditional meta-data in the result record. In the presentation function28, the results from the previous steps of searching 10,analysis/filtering 24, and categorization 26 are received in XML. Astandards-based template mechanism allows the results to be displayedrapidly in any desired format. Information can be organized intomultiple views such as “by relevance,” “by source”, and “by concept.”The relevance view orders the results at decreasing order based upon the“relevance score”. The source view provides a graphical tree-view of theresults organized by the sources from which they came from. And theconcept view provides a graphical, tree-view of the results, organizedinto the matching taxonomy of subjects from the categorization process26.

The tracking/alerts function 30 is an optional function that may be setup to run periodic searches for a given search query or set of searchqueries automatically and to alert the user when a desired set ofresults are obtained from the periodic searches, or when any results areobtained.

Referring to FIG. 2, an example screen display 32 of initial searchingscreen provides a field 34 into which a searcher can enter a searchquery. If the user enters the search query in this field, the exemplaryembodiment will perform an automatic search as described in furtherdetail below. Optionally, prior to entering a search query, the user canselect specific subjects from the source taxonomy 36 (provided in thisexemplary embodiment in the form of hyperlinks) to allow the search tobe performed within narrow sub-sets of information sources specific tothe subject matter of interest. The taxonomy 36 in the exemplaryembodiment includes an upper level of subjects 38 that generally definea subject matter and a second tier of more specific subject matters 40.As will be discussed in further detail below, upon selecting anidentified subject (hyperlink) in the source taxonomy 36 displayed inthe window 32, the system will then perform the searching in thespecific sub-set of sources represented by the subject heading/subject38/40 selected by the user.

As shown in FIG. 3, an example screen shot 42 is provided thatillustrates the results of performing a general search of the exemplaryembodiment using the search query of “cjd”. In the exemplary embodiment,if no specific subject headings 38 or subjects 40 are selected from thesubject taxonomy 36, then the exemplary embodiment will perform thesearch set forth in the search query from a federated group of Websearch engines (such as “MSN”, “LYCOS”, “TEOMA”, etc.) and display theresults of the search on the screen in order based upon relevance of thedocuments from the search results in comparison to the subject of thesearch query. Additionally, the exemplary embodiment also analyzes thesearch query to determine a subject matter (or subject matters) of thequery and provides links to the subsets of information sources 44 (inthe form of hyperlinks) associated with the subject matter(s) determinedfrom the search query above the general search results. If the userselects the identified subsets of information sources 44, the systemwill perform the same search in the sub-set of information sources.Exemplary methods for identifying the subject matters from the searchquery 34 are discussed in further detail below. In the example shown inFIG. 3, the search query “cjd” was identified by the system as beingrelated to the specific subject matters, “Health tips”, “Health news”and “Health discussions”. The system was able to make thisrecommendation based upon analyzing the query and identifying that theclosest subject heading that it corresponded to was health; hence, therecommendation from the system that this search be conducted within“health-related” sources.

As shown in FIG. 4, when a general search is requested, the exemplaryembodiment may also be configured to automatically perform the searchwithin a sub-set of information sources corresponding to a subjectmatter matching the search query. The display 46 shown in FIG. 4illustrates that the specific search for the search query “cjd” wasautomatically conducted within the sub-set of information sourcesassociated with the “Health Tips” subject matter. The search resultsresulting from this specific search may come from information sourcessuch as “American Medical Association”, “Intellihealth”, and “WebMD”,etc. for the best results on the subject.

FIG. 5 illustrates the exemplary structure of the source taxonomy 36,and FIG. 6 illustrates the exemplary subject-to-source map 42 (alsoreferred to, herein, as a category-to-source map). As discussed above,the subject-to-source map 42 is used to identify one or more informationsources corresponding to identified subject matters of the search query,to allow for more focused searching of the subject matter in thesesources. The subject-to-source map 42, in the exemplary embodiment, isarranged as a hierarchy that includes an upper level of subject headings38 (such as “health”), and for each subject heading 38 there are one ormore information source subsets 41 such as “health news”, “healthpublications”, “health tips”, and “alternative medicine” linked thereto.Finally, for information source subset 41, there are linked to it oneore more information sources 48. For example, the specific subject“health tips” will have linked to it information sources such as“American Medical Association”, “Intellihealth.com”, “WebMD.com”, etc.

FIG. 6 more specifically illustrates how a subject in the ontology ismapped to a group of sources or to a single source by an administratorin the exemplary embodiment. Health as a general subject 38 may bemapped to a group of searchable sources 40 called “health tips”. Themore narrow subjects under the general subject “health”, such as“cancer”, may be mapped to specialized sources providing information oncancer treatment, cancer trials, etc. The ability to map the subjectheaders and specific subjects to information source(s) is completelyflexible and can be tuned to the needs of the specific search scenarioin which the system will be used.

Referring to FIG. 7, as discussed above, the intelligent sourceselection function 12 utilizes a query analysis algorithm to determine asubject matter or subject matters of the search query, where suchidentified subject matters are used to help the user identify specificsub-sets of information sources to perform more focused searches.Generally, the query analysis algorithm uses a combination ofdeterministic look-ups within a group of provided entities lists 50along with fuzzy look-ups (“auto-categorization”) within aknowledge-base 54 to determine within a certain degree of confidence thesubject matter of the query. Then, based upon the determined subjectmatter(s), subsets of information sources can be provided for thesesubject matter(s) using the subject-to-source map 42.

Examples of entity lists 50 can be found in FIG. 8. For example, anentity list can include a list of ticker symbols or an entity list caninclude a list of company names. Other representative entity lists couldbe, for example, health conditions, places, sports, etc. Generally, anentity list 50 is a list of words, names, or other terms thatcollectively fall under a general subject heading 38 or fall under aspecific subject 40. As will be discussed in further detail below, themore general entity lists are referred in the exemplary embodiment as“fall through” lists (having a lower confidence level) and the morespecific entity lists are referred to as non-fall through lists (havinga higher confidence level).

FIG. 9 provides an example entity list-to-source mapping 52 which mapscertain entity lists directly to specific subject matters. For example,the mapping shown in FIG. 9 includes the entity list “places” mapped tothe specific subject matters “maps”, “travel guides”, “almanacs”, and“encyclopedias”. Additionally, the entity list “ticker symbols” ismapped to the subjects “financial discussions”, “financial tips”,“company profiles”, “SEC filings”, and “mutual funds”.

Referring again to FIG. 7, as mentioned above, the fuzzy look-up stepinvolves a “digital fingerprint” match of terms in the search query with“digital fingerprint” of topics in a subject knowledge-base 54. Thismethodology is referred to as “auto-categorization”, emanating from theproblem of trying to “automatically” find the “category” in a taxonomythat a stream of input text corresponds to. An example of theknowledge-base is shown in FIG. 10. The left pane 56 in the displayillustrates a subject hierarchy labeled the “Whole Web SubjectTaxonomy”. The first level 58 in the hierarchy are the general subjectheadings such as “Health”, the next level 60 in the hierarchy includesmore specific subject headings such as “Alternative”, “Child Health”,and “Conditions and Diseases” and the most specific level 62 in thehierarchy includes very specific subject matters such as “Cancer”,“Cardio-Vascular Disorders”, “Communication Disorders”, “DigestiveDisorders”, etc., which are specific subjects of the “Conditions andDiseases” subject heading in the second tier 60. The right pane 64 ofthe display provides a list of example documents 65 identified by theadministrator as being relevant to the selected subject, “DigestiveDisorders”, in the specific level 62 of the subject hierarchy 56.

Therefore, once the taxonomy of subjects 56 is created and exampledocuments 65 are provided to represent content typically found for eachsubject, the system will then learn from these example documents tocreate the knowledge-base 54 of subject matter representing theontology. In the general sense, the knowledge-base includes a list ofwords, phrases or other terms “learned” from the example documentsprovided for each subject. Generally, the methodology for “learning”from a taxonomy of subjects and example documents for each subject, isbased upon creating topic or subject specific “digital fingerprints”using the familiar vector-space model for analyzing and representing abody of unstructured texts. The “digital fingerprints” for topics are,in essence, weighted vectors of terms (words and phrases) that bestrepresent information most likely to be found in those specific topics.This “digital fingerprint” information is then stored in the “subjectknowledge-base” for enabling the query analysis.

More specifically, in the vector-space algorithm, a vector-space modelis trained off-line by parsing the collection of example documents foreach subject to generate a representative vector of terms andfrequencies for that subject. In the implementation of the exemplaryembodiment, the terms identified can be individual words or phrases(phrases are determined via a measure known as mutual information).Typically, the subject matter vectors are normalized in some fashion, toaccount for variation in the size and number of training documents. Inaddition, a uniqueness score is calculated for each term associated witha given subject. This uniqueness score is often referred to as “IDF” for“inverse document frequency” since one over the number of documents thata term appears in is one way to measure uniqueness. In the presentexemplary embodiment, the uniqueness score is one over the total of allnormalized category vector weights for that term. To classify texts, avector-space classifier parses the text to be classified to generate thevector of terms in frequencies. This vector is compared with the vectorscomputed off-line for each subject matter, taking into account theuniqueness of each term. In the implementation of the exemplaryembodiment, for each subject matter that has a non-zero normalizedweight for all terms in the text vector, and for each term in the text,the term frequency from the text is multiplied with the normalizedweight for the subject matter, then that value is multiplied by theuniqueness score for the term exponentiated by a configurable constant.These values are summed to give a score for each subject matter. Theresulting values determine which subject matters best match the text.

In the exemplary embodiment, the search query analysis program operatessubstantially as follows. Given a user's search query, at least portionsof the search query (i.e., after possibly eliminating stock words,and/or after stemming remaining words to root form) are compared againstzero or more of the entity lists 50, each of which may be stored in RAMas a dictionary. As discussed above, the general entity lists (havinglower confidence levels) are designated as fall-through lists, while themore specific entity lists (having a higher confidence value) aredesignated as non-fall-through lists. Accordingly, the fall-throughlists are assigned a confidence score of 1.0 and the non-fall-throughlists are assigned confidence scores of 1.5. If the search query ismatched with one or more of the non-fall-through lists, then theexemplary embodiment does not perform the “auto-categorization” of thesearch query; however, if not found in a non-fall-through list, then thequery is compared against the “fingerprints” in the knowledge-base 54 toidentify subject matters corresponding to the “fingerprint” of thesearch query. Any matches in this comparison will be assigned confidencelevels from 0 to 1 depending upon the confidence of the match. Thesubject matters developed from the auto-categorization step are added tothe array of subject matters developed in the comparison with the entitylists above. At this point, there exists an array of subject matters(entity list names and subject headings from the knowledge-base) alongwith associated confidence levels, where the array is sorted by theconfidence level. Each entry in the subject matter array is linked to asub-set of information sources using the subject-to-source map 42 asdiscussed above. In the exemplary embodiment, if a particular subjectcategory from the array is not found in the subject-to-source map 42,the parent category will be checked for a sub-set of informationsources. For example, if the subject matter heading“health/conditions&discases/digestive_disorders” is not found, then alook-up will be made for “health/conditions&diseases”. This step isrepeated until a sub-set of information sources is matched to thesubject matter (i.e., if “health/conditions&diseases” is not matchedwith a sub-set of information sources, then a look-up will be made forthe general heading of “health”). Thus, an array of searchableinformation source groups associated with the array of subject mattersand associated confidence levels has been constructed.

Furthermore, each information source in each respective sub-set ofinformation sources may also be ranked with respect to each otherutilizing the adaptive learner function 56. Generally speaking, theadaptive learner function 56 provides a method for prioritizing theinformation sources by rating (in real-time) the information sourcesbased upon the popularity of the source or upon other performance orstatistical considerations (or combinations thereof) to provideperformance scores 57 for the information sources. The adaptive learnerprocess is a means to learn the on-going performance of sources (in themanner in which they return relevant results to users on varioussubjects), so that the intelligent source selection function 12continually improves and keeps pace with the changing content orbehavior of the individual sources. From a simplistic perspective, thismethod simply rates the up-to-minute popularity of each source for eachsubject in the ontology.

As shown in FIG. 11, an internal database 58 maintains an internalranking of the performance of sources in specific subject areas. Forexample, the highlighted source in FIG. 11 “Mayo health” database hasbeen rated as the best performing source by the system having aperformance score 57 of 0.61. Some of the performance criteria utilizedin adjusting this performance score include: (a) adjusting theperformance score based upon the number of times users access the sourcefrom search result listings; (b) adjusting the score based upon theamount of time spent on each source; (c) adjusting the score based uponaccess problems or performance of the source (such as, lowering thescore if users have trouble accessing the source at various times); and(d) adjusting the score based upon user feedback, such as throughquestionnaires or rating polls. The impact of the adaptive learnerfunction 56 is not typically instantaneous to start with. Depending onthe subject-spread of the queries being performed, the source is put touse, and the volume of users and queries, the adaptive learner process56, over time, provides a reasonably accurate measure of the performanceof specific sources on specific subjects.

As mentioned above, the adaptive learner process 56 gauges the“popularity” of a particular information source for a particular subjectmeasured, in the exemplary embodiment, through result “click-throughs”from the community of users. The result links returned from thefederated search function 14 are directed to a “click-through” handlerwhen activated by a user. The “click-through” handler redirects theuser's browser to the actual result after optionally updating theper-source category weights for the information sources that returnedthe result. Optionally, the per-source category weights can be adjustedby the “click-through” handler periodically (i.e., every 100^(th)access) to reduce the rate of change. In the exemplary embodiment, eachresult link returned from the federated search function 14 include thefollowing: the original result link; a list of the information sourcesthat returned the result; the ESS query; and a list of the subjectsassigned to the search query.

In addition to the “click-through” handling described above, thefollowing measures can also be used to stabilize the “learning loop”.

-   -   1. Measure the duration of time the user spent looking        at/reading through a give result document and use this to        discern the “usefulness” of the document to the user, and by        correlation, the usefulness of the information source that        returned that document for the subject corresponding to the        search query;    -   2. Categorize the result document matched up with the subject        corresponding to the user's search query; and/or    -   3. Assign a penalty (something the would reduce the weight        value) to information sources, or are slow to respond        periodically.

Referring again to FIG. 7, the federated search function 14 performs thesubstantially parallel real-time searches on the plurality ofinformation sources. The federated search function 14 utilizes brokers66 which are electronic definitions stored on the system that define foreach of the information sources to the federated search function 14 howto interface with the respective information source; for example, howthe federated search function 14 is to communicate with the informationsource, how the federated search function is to structure its queries(in its native form) to the specific information source, how thefederated search function 14 interprets results from the particularinformation source, how the federated search function is to navigatethrough multiple “pages” of the results set from the specificinformation source, any security methods used by the particularinformation source, etc. An example broker definition for theIntellihealth.com information source is provided in FIG. 12.

The present invention also makes it possible for non-operational brokers(brokers can become non-operational if the information source theycorrespond to ceases to exist, moves to a different location, deliversdifferent content, delivers content in a different format, has newcapabilities for search and retrieval, has new security structures,etc.) to be healed automatically through an automated background testingprocess.

As mentioned above, the brokers 66 can provide the security parametersand credentials necessary for federated search system to access a secureor subscription information source or sources 22. Consequently, thepresent invention also provides a security handling architecture thatenables the system to proxy user credentials for multiple users tomultiple secure sources using multiple security methods in real-time.

As shown in FIG. 13, the multi-user, multi-source, multi-modal securityarchitecture utilizes a security broker function 68 within the federatedsearch system that utilizes user security information 70 and securityparameters embedded within the brokers 66 to drive the multi-usersession manager 72. The multi-user session manager 72 creates an activeuser session 74 for each secure source 22 respectively accessed by eachuser. Therefore, if, for example, WSJ Archives are accessed bythirty-three of the active users, then thirty-three active user sessions74 will be created for each individual access.

The security broker 68 is invoked during the federated search function14 for each secure information source in the search request. Thesecurity broker 68 examines the broker definition 66 to determine thetype of authentication (e.g., basic authentication, challenge-response,log-in form, etc.) required by the secure information source 22. Forsecure information sources that use a log-in form, the broker definition66 will also describe the log-in parameters used by the informationsource. Next, the security broker 68 retrieves the authenticationcredentials 70 assigned to the user for the secure information source.This information is stored in the user security database 70. Using thecombined information, the security broker 68 performs the initial stepsin the establishment of the per-user session and verifies that thesession has been successfully initialized. If the secure informationsource uses session parameters, the security broker 68 extracts theparameters from the response and stores them in the respective activeuser session 74. From this point on, the federated search process 14proceeds normally. If the secure information source 22 uses sessionparameters, the security broker 68 will be re-invoked at each step inthe search process to transmit the appropriate session parameters forthe respective active user session 74. As discussed above, the sessionmanager 72 is responsible for maintaining a separate active user session74 for each user/source combination. Separate “session parameters” aremaintained by the session manager 72 for each active user session 74.FIG. 14 illustrates the conceptual organization of the internal securityinformation structure maintained by the session manager.

As shown in FIG. 14, session parameters are stored in “Session Details”records and state is managed for each secure source searched by eachuser in real-time. Such session parameters may include, cookies, sessionparameters, session IDs, sessions date, etc. The session parameters willvary depending upon the mode, type of security encountered at eachsecure source. Using this dynamic security information structure, thesession manager 72 maintains the integrity of the unique securityrequirements at each secure source 22 in a multi-user environment, whileat the same time, not compromising a user's privacy of individualsecurity requirements. It should be understood that it is within thescope of the invention that at least certain of the security credentialsand/or session parameters may be shared by certain users (or groups ofusers) during the accessing and/or searching steps. These sharedcredentials/parameters may be included in the “Session Details” recordsfor each user or in a shared record accessible for all users sharing thecredentials/parameters.

FIG. 15 illustrates a visual broker-definition tool 76 (referred to asthe “Agent Development Kit” or “ADK”) that provides the exemplaryembodiment of the present invention with the ability to create thebrokers 66 for the information sources using a substantially automatedprocess. This broker-definition tool 76 automatically analyzes thestructure and form from the search result content generated by asearchable information source to determine patterns that exist withinit; and automatically generates the necessary pattern extraction logicfor the broker substantially without any user involvement. Thebroker-definition tool 76, in this exemplary embodiment, utilizesfamiliar “wizards” interface in a left pane 78 to guide the user rapidlythrough the broker generation process. The right pane 80 provides visualresults of the information source search result output or of the brokeroutput. As can be seen in FIG. 15, the interface pane 78 first requeststhat the user enter the information source address in field 82 andactivate the “Capture” icon 84. In the right pane 80 the graphicalinterface for the selected information source is presented. Next, theinterface 78 requests that the user run a search query on the requestedinformation source and wait for the results to be displayed. As can beseen in the right pane 80 on FIG. 15, the uset has requested a searchfor documents related to “patio furniture”. Finally, this interface 78requests that the user enter the number of results received on the pageshown in the pane 80. When these three steps are completed, the useractivates the “Next” icon 86.

Referring to FIG. 16, the broker-definition tool 76 then automaticallyextracts search results 88 from the search results generated by theinformation source shown in the pane 80. The broker-definition toolaccomplishes this utilizing automatic pattern detection, extraction andgeneration. The basis for this process can be understood by noting thatvirtually every searchable source provides its search output through aprogram-generated HTML page. Inherent in this observation is the factthat program-generated pages (especially where repeating elements areincluded, like a series of search results) have some pattern driving itsproduction. This makes it possible, in most cases, to put together amethodology to find that pattern, and generate logic to extract it.Consequently, the broker-definition tool extracts the search resultsfrom the information source, generally, using the following methodology;first, the HTML document corresponding to the result page shown in thepane 80 is saved locally to a file; next, the file is parsed utilizing aspecialized parser that distills the “structure” of the page (locatingtables, paragraphs, divisions, etc.) from the “cosmetics” of the page(what font is being used, what color is being used, where an image isinserted, etc.); next, with this distilled structure of the page, thebroker-definition tool proceeds to find “blocks” of structure(paragraphs, table rows, tables, etc.) repeating some minimum number oftimes (the broker-definition tool takes the input provided by the useron the previous page answering the question “enter the number of resultsyou received on the page.”); next, if at least some minimum number ofrepeating blocks are discovered, then the broker-definition tool looksto see that these blocks contain some essential elements that aretypical of search results (“essential elements”, for example, are fieldsor entities such as a URL—a link to a detailed record, a title—a brieftitle of the individual results, a date, a summary, etc.); next, if theblocks have been found to contain at least some of these essentialelements, the broker-definition tool proceeds to create “regularexpressions for each of these fields” and one for the blocksrepresenting the result record; with the regular expressions in place,the broker-definition tool proceeds to apply the regular expression onthe text of the original result page and extracts only those portions ofthe text that correspond to the result records and fields containedwithin them; finally, these extracted results are then displayed in theleft pane 88 as shown in FIG. 16.

A “regular expression” is a classic computer science device utilized to“extract” the desired portion of text or other information from a largerstream of text. See

-   -   http://www.python.org/doc/lib/re-syntax.html or        http://msdn.microsoft.com/library/default.asp?url=/library/en-us/script56/html/js56jsrpRegExpSyntax.asp        for more information on regular expressions. Typically, regular        expressions have been created by advanced/power users or        developers for solving information extraction problems. The        broker-definition tool methodology takes this powerful method        and makes it work in a simple visual interface.

If the broker-definition tool 76 is successful in performing theautomatic pattern detection, regular expression generation and resultextraction for every single source available, then the broker generationprocess could indeed become 100% automatic. Nevertheless, the process issemi-automated because there are typically situations where there areexceptions that cannot be dealt with automatically by thebroker-definition tool such as, for example when unique fields ofinformation exist within the result records, (for example, a thumbnailpicture, a price, a delivery date, etc., that may all be specific to asearch source, these need to be specified by the user and then thebroker-definition tool can generate the expressions for them); and whenthe search result records vary in structure for each record (forexample, the source may optionally include, for example, a specialdiscount price only for a few of the returned records).

FIG. 17 illustrates a source-specific “search query translation” in theinterface 78 of the broker-definition tool 76 to enable universalsearches to be conducted using a single query language to multipledisparate sources. As shown in FIG. 17, in the interface pane 78 theuser is able to select alternate languages other than English that aresupported in the information source's query field. Then, the interfaceprovides fields 92 where the user can specify the symbols or terms usedfor the boolean operations of a general search tool. Therefore, thesystem of the exemplary embodiment of the present invention implementscapabilities such as searching for “all of the words”, “any of thewords”, “phrase”, “boolean”. Boolean queries specifically allow users tocombine terms using operators such as ‘AND’, ‘OR’, ‘AND NOT’, ‘NEAR’,‘NEAR/N’ to accurately gather the type of information needed. Eachinformation source, however, is equipped with different levels ofcapability for searching the information repositories they provideaccess to. Specifically, the query language syntax may vary widely. Forexample, in some sources, the search for

<“pancreatic cancer” and “treatment protocol”>

may be expressed as

<+“pancreatic cancer”+“treatment protocol”>

This means that queries provided for federated searching by users needto be “translated” into the native syntax for each source by the brokers66. This query translation is specified through the broker definitionprocess, and it enables universal searches to be conducted using asingle query language to multiple disparate sources.

FIG. 18 illustrates a self-contained testing capability within thebroker-definition tool 76 that permits a broker that has been created tobe tested immediately. The interface pane 78 provides fields 94 for theuser to enter a search query, a search type and list the number of pagesin the results. These fields may also request a user name and passwordif the source is a secure source. Once these fields are filled in theuser activates the “test icon” and a testing interface will actuallyperform a live query against the information source (for which thebroker is being defined), just as the federated search function 14 wouldin the run-time system, and gathers the result data, and applies thebroker definition to extract result records in all the defined fields ofeach result record. This extracted result set is presented in the rightpane 80 to give instant feedback to the user on how well their brokerdefinition is working and if it is ready for deploying to the run-timesystem.

FIG. 19 illustrates how the broker-definition tool 76 is able to capturesecurity information for a secure source. The broker definition cancapture information by multiple security methods including the standard“HTTP basic authentication”, and “web-based log-in forms”. As shown inFIG. 19, the interface pane 78 includes a form 98 in which the user canidentify the type of security that is being used by the search engineand a field 100 where the user enters the URL or address of the secureinformation source. Once the log-in page for the secure source is loadedinto the right pane 80, the broker-definition tool captures thenecessary log-in details, such as navigating the log-in form, loggingin, navigating to the search interface, etc., by “watching” (recoding)the user's interaction with the information source in the right pane 80.These security credentials will then be stored in the brokers 66 asdiscussed above. As also discussed above, the security broker 68 duringthe federated searching function 14 will essentially “replay” the log-inprocess to connect to the secure information source, and to supply theuser's credentials for that source transparently, prior to performing asearch. Nuances such as handling session cookies that may be set foreach user, by each secure source, are transparently handled by thesecurity broker 68 at run-time.

Following from the above description and invention summaries, it shouldbe apparent to those of ordinary skill in the art that, while thesystems and processes herein described constitute exemplary embodimentsof the present invention, it is to be understood that the invention isnot limited to these precise systems and processes and that changes maybe made therein without departing from the scope of the invention asdefined by the claims. Additionally, it is to be understood that theinvention is defined by the claims and it is not intended that anylimitations or elements describing the exemplary embodiments set forthherein are to be incorporated into the meaning of the claims unless suchlimitations or elements or explicitly listed in the claims. Likewise, itis to be understood that it is not necessary to meet any or all of theidentified advantages or objects of the invention disclosed herein inorder to fall within the scope of any claims, since the invention isdefined by the claims and since inherent and/or unforeseen advantages ofthe present invention may exist even though they may not have beenexplicitly discussed herein.

1. A computer implemented method for generating a searchable sourcebroker for defining patterns of search-result information specific to asearchable source, the method comprising the steps of: accessing a givensearchable source; performing an example search on the given searchablesource to produce search results by that searchable source; identifyingregular expressions from the search results.
 2. The computer implementedmethod of claim 1, further comprising the step of storing the regularexpressions for the given searchable source for subsequent re-use by afederated search system.
 3. The computer implemented method of claim 1,wherein: the step of identifying regular expressions is performedsubstantially automatically; the method further comprises the step ofreviewing, by a user, output of applying the regular expressions tosearch results produced by the given searchable source; and the methodfurther comprises the step of approving by the user the regularexpressions based upon the reviewing step.
 4. The computer implementedmethod of claim 3, wherein the method further includes a step ofmodifying the regular expressions by the user before the approving step,if the user determines the modifying step is necessary based upon thereviewing step.
 5. The computer implemented method of claim 3, whereinthe reviewing step involves the step of simultaneously displaying to theuser search results produced by the given search and the output ofapplying the regular expressions to the search results.
 6. The computerimplemented method of claim 1, wherein the step of identifying regularexpressions includes the steps of: distilling a structure of the searchresults; parsing the search results to distill a structure of the searchresults; identifying repeating blocks of information from the parsedsearch results; identifying essential search-result elements from therepeating blocks of information; and generating a regular expression foreach identified essential search-result elements and a regularexpression for the repeating block.
 7. The computer implemented methodof claim 6, wherein the essential search-result elements include atleast one element taken from a group consisting of: a title; a URL; adate; a key-word; a summary; a passage; and a score.
 8. The computerimplemented method of claim 2, wherein the accessing step includes thesteps of: providing a log-in form, for the searchable source; logginginto the searchable source by entering the appropriate log-ininformation to the log-in form by the user; recording securitycredential information provided by the user during the logging step; andstoring the security credential information with the searchable sourcebroker for re-use by the searchable source broker in the federatedsearch system.